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Abstract 

This paper describes the implementation and evaluation 
of an operating system module, the Congestion Manager 
(CM), that provides integrated network flow manage- 
ment and exports a convenient programming interface 
that allows applications to be notified of, and adapt to, 
changing network conditions. We describe the API by 
which applications interface with the CM, and the ar- 
chitectural considerations that factored into the design. 
To evaluate the architecture and API, we first describe 
our implementation of TCP, a streaming layered au- 
dio/video application, and an interactive audio applica- 
tion using the CM, and show that they achieve adaptive 
behavior without incurring much end-system overhead. 
All flows including TCP benefit from the sharing of con- 
gestion information, and applications are able to incor- 
porate new functionality such as congestion control and 
adaptive behavior. 

1 Introduction 

The impressive scalability of the Internet infrastructure 
is in large part due to a design philosophy that advo- 
cates a simple architecture for the core of the network, 
with most of the intelligence and state management im- 
plemented in the end systems [9]. The service model 
provided by the network substrate is therefore primar- 
ily a "best-effort" one, which implies that packets may 
be lost, reordered or duplicated, and end-to-end delays 
may be variable. Congestion and accompanying packet 
loss are common in heterogeneous networks like the In- 
ternet because of overload, when demand for router re- 
sources, such as bandwidth and buffer space, exceeds 
what is available. Thus, end systems in the Internet 
must incorporate mechanisms for detecting and react- 
ing to network congestion, probing for spare capacity 
when the network is uncongested, as well as managing 
their available bandwidth effectively. 

Previous work has demonstrated that the end result 
of of uncontrolled congestion leads to a phenomenon 
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commonly called "congestion collapse" [7, 12]. Con- 
gestion collapse is largely alleviated today because 
the popular end-to-end Transmission Control Protocol 
(TCP) [30, 40], incorporates sound congestion avoid- 
ance and control algorithms. However, while TCP does 
implement congestion control [17], many applications 
including the Web [5, 11] use several logically differ- 
ent streams in parallel, using multiple concurrent TCP 
connections between the same pair of hosts. As several 
researchers have shown [2, 3, 27, 28, 42], these con- 
current connections compete with - rather than learn 
from - each other about network conditions to the same 
receiver, and end up being unfair to other applications 
that use fewer connections. The ability to share conges- 
tion information between concurrent flows is therefore 
a useful feature, one that promotes cooperation among 
different flows rather than adverse competition. 

Another important trend in today's Internet is the 
increasing number of applications that do not use TCP 
as their underlying transport, because of the constrain- 
ing reliability and ordering semantics imposed by its 
in-order byte-stream abstraction. Streaming audio and 
video [24, 34, 41] and customized image transport proto- 
cols for JPEG-like formats, where portions of an image 
can be rendered out-of-order, are important examples. 
Such applications use custom protocols that run over 
the User Datagram Protocol (UDP) [29], often with- 
out implementing any form of congestion control. The 
unchecked proliferation of such applications will have 
a significant adverse effect on the stability of the net- 
work [3, 7, 12]. 

A majority of Internet applications deliver HTML 
documents and images or stream audio and video to 
end users and are interactive in nature. A simple but 
useful figure-of-merit for interactive content delivery is 
the end-to-end download latency; users typically wait 
no more than a few seconds before aborting a transfer. 
Therefore, it would be beneficial for content providers 
to adapt what they disseminate to the state of the net- 
work, so as not to exceed a threshold latency. For- 
tunately, such content adaptation is possible for most 
applications. Streaming audio and video applications 
typically encode information in a range of formats cor- 
responding to different encoding (transmission) rates 



and degrees of loss resiliency. Image encoding formats 
accommodate a range of qualities to suit a variety of 
client requirements. 

Today, the implementor of an Internet content dis- 
semination application has a challenging task: For her 
application to be safe for widespread Internet deploy- 
ment, it must either use TCP, suffering the conse- 
quences of its fully-reliable, byte-stream abstraction, or 
use an application-specific protocol over UDP, but at- 
tempt to implement congestion control in it, reinvent- 
ing this machinery and risking getting it wrong. Fur- 
thermore, neither alternative allows for sharing conges- 
tion information across flows. In addition, one of the 
undesirable side effects of the layered protocol stack, 
the common application programming interface (API) 
classes — Berkeley sockets, streams, and Winsock [31] — 
do not expose any information about the state of the 
network in a standard way to applications 1 . This makes 
it difficult for applications running on existing end host 
operating systems to make an informed decision, taking 
network variables into account, during content adapta- 
tion. 

This paper describes the implementation and eval- 
uation of an end-host operating system module and its 
API that enables network-adaptive applications. Here, 
we build on our recent proposal for a Congestion Man- 
ager (CM) [3], an end-system architecture for shar- 
ing congestion information between multiple concurrent 
flows. The advantage of the CM is that it moves the 
task of performing sound congestion control to a trusted 
kernel module, freeing transport protocols and applica- 
tions from having to re-implement it and ensures that 
the ensemble of concurrent flows is not overly aggressive 
to the network. 

While our previous work provided the rationale for 
such an approach and laid out an initial design for the 
CM, this paper details the design and implementation 
of the CM architecture and describes how it collects 
end-to-end information about the network and provides 
it to applications through its API. We show using spe- 
cific case studies how applications can adapt their trans- 
missions to changing network conditions using the CM 
API. Conceptually, our system is based on feedback in 
the form of callbacks from the in-kernel CM to an ap- 
plication that are used to orchestrate its transmissions. 
We show that our implementation of callbacks does not 
affect performance or scalability of a data sender. 

To our knowledge, this is the first implementation 
of a general application-independent system that com- 
bines integrated flow management with a convenient 
API for content adaptation. The end-result is that ap- 
plications achieve the desirable congestion control prop- 
erties of long-running TCP connections, together with 
the flexibility to adapt data transmissions to prevailing 
network conditions. 



Utilities like netstat and ifconfig provide some infor- 
mation about devices, but not end-to-end performance 
information that can be used for adapting content. 



Another important contribution of this paper is to 
demonstrate that the CM extensions are useful even 
when applied to the sender side alone, requiring no 
changes to the data receiver. Because most robust con- 
gestion control algorithms rely on receiver feedback, it is 
natural to expect that a CM receiver is needed to inform 
the CM sender of successful transmissions and packet 
losses. However, to facilitate deployment, we have de- 
signed our system to take advantage of the fact that 
several protocols including TCP and other applications 
already incorporate some form of application-specific 
feedback. 

We demonstrate the benefits of integrated flow man- 
agement and show how applications can adapt their 
transmissions using callbacks initiated by the kernel. 
We do this by answering the following key questions in 
this paper: Does the CM provide a convenient interface 
for applications such as streaming layered video/audio, 
real-time audio, and TCP to adapt without placing a 
significant burden on developers? What information is 
needed from applications for the correct functioning of 
the system? In today's off-the-shelf operating systems, 
does the CM place any performance limitations upon 
applications? 

We answer these questions based on our implemen- 
tation of the CM in the Linux operating system and 
its measured performance over a variety of TCP and 
UDP applications. We find that our implementation 
of TCP (which uses the CM for its congestion control) 
has essentially the same performance as standard TCP, 
with the added benefits of integrated congestion man- 
agement across flows. Our implementation of a lay- 
ered streaming audio/video application demonstrates 
that CM architecture can be used to implement highly 
adaptive congestion controlled applications. Adapta- 
tion via the CM API helps these applications achieve 
better performance and also be fair to other flows on 
the Internet. We find that the API introduces a neg- 
ligible to observable (between 2.5% and 18.5%) CPU 
overhead depending on the type of callback used, but 
that the higher CPU utilizations occur when the appli- 
cation desires very fine-grained about the network on 
a per-packet basis. We do not believe this is a bad 
trade-off, especially because our mid-range PC sender 
(350 MHz Pentium) easily saturates a 100 Mbps Ether- 
net despite this overhead. We have also CM-enabled a 
legacy application — the Internet audio tool vat from the 
MASH toolkit [22] — to perform adaptive real-time de- 
livery. Since less than one hundred lines of source code 
modification was required to CM-enable this complex 
application and make it adapt to network conditions, 
we believe it demonstrates the ease with which the CM 
makes applications adaptive. 

The rest of this paper is organized as follows. Sec- 
tion 2 describes our system architecture and implemen- 
tation. Section 3 describes how network-adaptive appli- 
cations can be engineered using the CM, while Section 4 
presents results of several experiments. In Section 5, 



we discuss some miscellaneous details and open issues 
in the CM architecture. We survey related work in Sec- 
tion 6 and conclude with a summary in Section 7. 

2 System Architecture and Implemen- 
tation 

The CM performs two important functions. First, it en- 
ables efficient multiplexing and congestion control by in- 
tegrating congestion management across multiple flows. 
Second, it enables efficient application adaptation to 
congestion by exposing its knowledge of network con- 
ditions to applications. Most of the CM functionality 
in our Linux implementation is in-kernel; this choice 
makes it convenient to integrate congestion manage- 
ment across both TCP flows and other user-level pro- 
tocols, since TCP is implemented in the kernel. 

To perform efficient aggregation of congestion infor- 
mation across concurrent flows, the CM has to identify 
which flows potentially share a common bottleneck link 
en route to various receivers. In general, this is a diffi- 
cult problem, since it requires an understanding of the 
paths taken by different flows. However, in today's In- 
ternet, all flows destined to the same end host take the 
same path in the common case, and we use this group 
of flows as the default granularity of flow aggregation 2 . 
We call this group a macroflow: a group of flows that 
share the same congestion state, control algorithms, and 
state information in the CM. Each flow has a sending 
application that is responsible for its transmissions; we 
call this a CM client. CM clients are in-kernel protocols 
like TCP or user-space applications. 

The CM incorporates a congestion controller that 
performs congestion avoidance and control on a per- 
macroflow basis. It uses a a window-based algorithm 
that mimics TCP's additive-increase/multiplicative de- 
crease (AIMD) scheme to ensure fairness to other TCP 
flows on the Internet. However, the modularity pro- 
vided by the CM encourages experimentation with 
other non-AIMD schemes that may be better suited to 
specific data types such as audio or video. 

While the congestion controller determines what the 
current window (rate) ought to be for each macroflow, 
a scheduler decides how this is apportioned among the 
constituent flows. Currently, our implementation uses a 
standard weighted round-robin scheduler whose weights 
are settable by an administrator. 

In-kernel CM clients such as a TCP sender use CM 
function calls to transmit data and learn about net- 
work conditions and events. In contrast, user-space 
clients interact with the CM using a portable, platform- 
independent API described in Section 2.1. A platform- 
dependent CM library, libcm, is responsible for inter- 
facing between the kernel and these clients, and is de- 

2 This is not strictly true in the presence of network-layer 
differentiated services. We address this issue later in this 
section and in Section 5. 
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Figure 1. Architecture of the congestion manager at 
the data sender, showing the CM library and the CM. 
The dotted arrows show callbacks, and solid lines show 
the datapath. UDP-CC is a congestion-controlled UDP 
socket implemented using the CM. 



scribed in Section 2.2. These components are shown in 
Figure 1. 

When a client opens a CM-enabled socket, the CM 
allocates a flow to it and assigns the flow to the appro- 
priate macroflow based on its destination. The client 
initiates data transmission by requesting permission to 
send data. At some point in the future depending on the 
available rate, the CM issues a callback permitting the 
client to send data. The client then transmits data, and 
tells the CM it has done so. When the client receives 
feedback from the receiver about its past transmissions, 
it notifies the CM about these and continues. 

When a client makes a request to send on a flow, the 
scheduler checks whether the corresponding macroflow's 
window is open. If so, the request is granted and the 
client notified, upon which it may send some data. 
Whenever any data is transmitted, the sender's IP layer 
notifies the CM, allowing it to "charge" the transmis- 
sion to the appropriate macroflow. When the client re- 
ceives feedback from its remote counterpart, it informs 
the CM of the loss rate, number of bytes transmitted 
correctly, and the observed round trip time. On a suc- 
cessful transmission, the CM opens up the window ac- 
cording to its congestion management algorithm and 
attempts to grant a pending request on a flow asso- 
ciated with this macroflow. The scheduler also has a 
timer-driven component to perform background tasks 
and error handling. 



2.1 CM API 

The CM API is specified as a set of functions and call- 
backs which a client uses to interface with the CM. It 
specifies functions for managing state, for performing 
data transmissions, for applications to inform the CM 
of losses, for querying the CM about network state, and 
for constructing and splitting macroflows if the default 
per-destination aggregation is unsuitable for an appli- 
cation. 



2.1.1 State management 

All CM applications call cm_open() before using the 
CM, passing the source and destination addresses, 
transport-layer port numbers, and protocol numbers 
as arguments. This returns a CM flow identifier 
(cm_flowid), which is used as a handle for all future 
CM calls. When a flow terminates, the application is 
expected to call cm_close(cm_f lowid) to clean up inter- 
nal state. If a flow has been inactive for a pre-configured 
amount of time, its state in the CM is purged. Appli- 
cations can also use the cm_mtu() call to obtain the 
maximum transmission unit (MTU, the largest unfrag- 
mented datagram size) to a destination. Inside the 
CM, this is either pre-configured or obtained using path 
MTU discovery to the receiver and cached [25] . 

2.1.2 Data transmission 

There are three ways in which an application can use 
the CM to transmit data. These allow a variety of adap- 
tation strategies, depending on the nature of the client 
application and its software structure. 

(i) Buffered send. This API is similar to a con- 
ventional blocking write () call, but the result- 
ing data transmission is paced by the Congestion 
Manager. We use this to implement a generic 
congestion-controlled UDP socket (without con- 
tent adaptation), useful for bulk transmissions 
that do not require TCP-style reliability or fine- 
grained control over what data gets sent at any 
point in time. While this is useful for bulk data, it 
is not convenient for an adaptive sender that might 
want to revisit and change prior transmission de- 
cisions once packets are buffered in the CM, when 
it learns that network conditions have appreciably 
changed. 

(ii) Request/callback. This is the preferred mode 
of communication for adaptive senders that are 
based on the ALF principle. Here, the client 
does not send any data via the CM; rather, 
it makes a cmjrequest(cm_f lowid) call and ex- 
pects a notification (implemented as a well-known 
cmapp_send() client callback) at some point in the 
future when this request is granted by the CM. 
This approach puts the sender in firm control of 



deciding what to transmit at any point in time 
and allows it to track fine-grained changes in avail- 
able bandwidth. It allows the sender to adapt to 
sudden changes in network performance, which is 
hard to do in a conventional buffered transmission 
API. The client callback is a grant for the flow 
to send up to MTU bytes of data. Observe that 
cmjrequestQ does not take the number of bytes 
or MTU-sized units as an argument; each call to 
cmjrequestQ is an implicit request for sending 
up to MTU bytes. This simplifies the internal im- 
plementation of the CM scheduler and congestion 
controller at the expense of a slightly more com- 
plex interface This API is ideally suited for an im- 
plementation of TCP, since it needs to make a de- 
cision at each stage about whether to retransmit 
a segment or send a new one. 

(iii) Rate callback. A self-timed application (eg. 
vat which samples periodically from the audio de- 
vice) transmitting on a fixed schedule may re- 
ceive callbacks from the CM notifying it when 
the parameters of its communication channel have 
changed, so that it can change the frequency of 
its timer loop or its packet size. Such clients do 
not use the request/callback API — if clients want 
their transmissions time-synchronized, they do it 
themselves — the CM provides the necessary infor- 
mation via the cmapp_update() callback that in- 
forms the client of the current rate available to it, 
the round-trip time, and packet loss rate along the 
path. The client registers a callback threshold us- 
ing the cm_thresh(down, up) call; if the rate re- 
duces by a factor of down or increases by a factor 
of up, the CM calls cmapp_update(). This trans- 
mission API is ideally suited for streaming layered 
audio and video. 

2.1.3 Application notifications 

One of the goals of our work was to investigate a 
CM implementation that made no kernel changes or 
installed additional software at receivers on the In- 
ternet. Since performing congestion management re- 
quires feedback about transmissions, the CM provides 
clients with functions to provide it with feedback. 
A client calls cm_update( cm_f lowid, nsent, nrecd, 
lossmode, rtt) to inform the CM about the number 
of sent and received packets, type of congestion loss if 
any, and a round-trip time sample. The CM distin- 
guishes between "persistent" congestion as would occur 
on a TCP timeout, versus "transient" congestion when 
only one packet in a window is lost. It also allows con- 
gestion to be notified using Explicit Congestion Notifi- 
cation (ECN) [32], which uses packet markings rather 
than drops to infer congestion. 

To perform accurate bookkeeping of the conges- 
tion window and outstanding bytes for a macroflow, 
the CM needs to know of each successful transmission 



from the host. Rather than encumber clients to report 
this information, we modify the IP output routine to 
call cnuiotif y(cm_f lowid, nsent) on each transmis- 
sion. (The IP layer obtains the cm_f lowid using a well- 
defined CM interface that takes the flow parameters 
(addresses, ports, protocol field) as arguments.) How- 
ever, if a client decides not to transmit any data upon a 
cmapp_send() callback invocation, it is expected to call 
cm_notify(dst , 0) to allow the CM to permit some 
other flows on the macrofiow to transmit data. 

2.1.4 Querying 

If a client wishes to learn about its (per-fiow) avail- 
able bandwidth and round-trip time, it can use the 
cm_query() call that returns these quantities. This 
is especially useful at the beginning of a stream when 
clients can make an informed decision about the data 
encoding to transmit (e.g., a large color or smaller grey- 
scale image). 

2.1.5 Macrofiow management 

One of the decisions the CM needs to make is the gran- 
ularity at which a macrofiow is constructed, by de- 
ciding which flows belong to a single macrofiow and 
share congestion information. While the default is per- 
destination sharing, the CM API also provides two func- 
tions that allow applications to decide which of their 
streams ought to belong (or not belong) to the same 
macrofiow. 

cm_getmacrof low ( cm J lowid) returns 

a unique macrofiow identifier, while 

cm_setmacrof low (cmjnacrof lowid, cm_f lowid) 
sets the macrofiow of the flow cm_f lowid to 
cmjnacrof lowid. If the cmjnacrof lowid that is 
passed to cm_setmacrof low() is -1, then a new 
macrofiow is constructed and this is returned to the 
caller. Each call to cm_setmacrof low() overrides the 
previous macrofiow association for the flow, should 
one exist. We expect this API to become useful as the 
CM starts getting deployed over networks with service 
differentiation, such as differentiated services. 

2.2 libcm: The CM library 

The CM library provides the user with the convenience 
of a callback-based API while separating them from the 
details of how the kernel to user callbacks are imple- 
mented. While direct function callbacks are convenient 
and efficient in the same address space, as is the case 
when the kernel TCP is a client of the CM, callbacks 
from the kernel to user code in conventional operating 
systems are more difficult. A key decision in the imple- 
mentation of libcm was choosing a kernel/user interface 
that maximizes portability, and minimizes both perfor- 
mance overhead and the difficulty of integration with 
existing applications. The resulting interface is: 



1. select () on a single per-application CM control 
socket. The write bit indicates that a flow may 
send data, and the exception bit indicates that 
network conditions have changed. 

2. Perform an ioctl to extract a list of all flow IDs 
which may send, or to receive the current network 
conditions for a flow. 



2.2.1 Implementation alternatives 

We considered a number of mechanisms with which to 
implement libcm. In this section, we discuss our rea- 
sons for choosing the control-socket+select+ioctl ap- 
proach. 

While much research has focused on reducing the 
cost of crossing the user/kernel boundary (extensible 
kernels in SPIN [6], fast, generic IPC in Mach [4], etc.) 
many conventional operating systems remain limited 
to more primitive methods for kernel-to-user notifica- 
tion, each with their own advantages and disadvan- 
tages. While functionality like the Mach port set-based 
IPC would be ideal for our purposes, pragmatically we 
considered four common mechanisms for kernel to user 
communication: Signals, system calls, semaphores, and 
sockets. A discussion of the merits of each follows. 

Signals have several immediate drawbacks. First, 
if the CM were to appropriate an existing signal for 
its own use, it might conflict with an application us- 
ing the same signal. Avoiding this conflict would re- 
quire the standardization of a new signal type, a pro- 
cess both slow and of questionable value, given the ex- 
istence of better alternatives. Second, the cost to an 
application to receive a signal is relatively high, and 
some legacy applications may not be signal-safe. While 
the new POSIX 1003.1b [16] soft realtime signals allow 
delivering a 32-bit quantity with a signal, applications 
would need to follow up a signal with a system call to 
obtain all of the information the kernel wished to de- 
liver, since multiple flows may become ready at once. 
For these reasons, we consider mandating the use of 
signals the wrong course for implementing the kernel 
to user callbacks. However, we provide an option for 
processes to receive a SIGIO when their control socket 
status changes, akin to POSIX asynchronous I/O. 

System calls that block do not integrate well with 
applications that already have their own event loop, 
since without polling, applications cannot wait on the 
results of multiple system calls. A system call is able 
to return immediately with the data the user needs, 
but the impediments it poses to application integration 
are large. System calls would work well in a threaded 
environment, but this presupposes threading support, 
and the select-based mechanism we describe below can 
be used in a threaded system without major additional 
overhead. 

Semaphores suffer from the immediate drawback 
that they are not commonly used in network applica- 



tions. For an application that uses semop on an ar- 
ray of semaphores as its event loop, a CM semaphore 
might be the best implementation avenue, for many of 
the same reasons that we chose sockets for network- 
adaptive applications. However, most network appli- 
cations use socket sets instead of semaphore sets, and 
sockets have a few other benefits, which we discuss next. 

Sockets provide a well-defined and flexible inter- 
face for applications in the form of the select () sys- 
tem call, though they have a downside similar to that of 
signals: an application wishing to receive a notification 
via a socket in a non-blocking manner must select () 
on the socket, and then perform a system call to obtain 
data from the socket. However, a select-based inter- 
face meshes well with many network applications that 
already have a select-loop based architecture. Utiliz- 
ing a control socket also helps restrict the code changes 
caused by the CM to the networking stack. 

Finally, we decided to use a single control socket 
instead of one control socket per flow to avoid unnec- 
essary overhead in applications with large numbers of 
open socket descriptors, such as select () -based web- 
servers and caches. Because some aspects of select scale 
linearly with the number of descriptors, and many op- 
erating systems have limits on the number of open de- 
scriptors, we deemed doubling the socket load for high- 
performance network applications a bad idea. 

2.2.2 Extracting data from the socket 

Select provides notification that "some event" has oc- 
cured. In theory, 7 different events could be sent by 
abusing the read, write, and exception bits, but appli- 
cations need to extract more information than this. The 
CM provides two types of callbacks. Generally speak- 
ing, the first is a "permission to send" callback for a 
particular flow. To maintain fairness, a loose order- 
ing should be preserved with these messages, but exact 
ordering is unimportant provided no flows are ignored 
until the application receives further updates (thereby 
starving the flows) . If multiple permission notifications 
occur, the application should receive all of them so it 
can send data on all available flows. The second call- 
back is a "status changed" notification. If multiple sta- 
tus changes occur before the application obtains this 
data from the kernel, then only the current status mat- 
ters. 

The weak ordering and lack of history prompted us 
to choose using an ioctl-based query instead of a read 
or message queue interface, minimizing the state that 
must be maintained in the kernel. Status updates sim- 
ply return the current CM-maintained network state 
estimate, and "who can send" queries perform a select- 
like operation on the flows maintained by the kernel, re- 
quiring no extra state, instead of a potentially expensive 
per-process message queue or data stream. Returning 
all available flows has an added benefit of reducing the 
number of system calls which must be made if several 



flows become ready simultaneously. 

3 Engineering Network-adaptive Appli- 
cations 

In this section, we describe several different classes of 
applications, and describe the ways those applications 
can make use of the CM. We explore two in-kernel 
clients, and several user-space data server programs, 
and examine the task of integrating each with the CM. 

3.1 Software Architecture Issues 

Typical network applications fall into one of several cat- 
egories: 

• Data-driven: Applications that transmit prespec- 
ified data, such as a single file, then exit. 

• Synchronous event-driven: Self-timed data deliv- 
ery servers, like streaming audio servers. 

• Asynchronous event-driven: File servers (http, 
ftp) and other network-clocked applications. 

The CM library provides several options for adap- 
tive applications that wish to make use of its services: 

1. Data-driven applications may use the buffered 
send API to efficiently pace their data transmis- 
sions. 

2. An application may operate in an entirely 
callback-based manner by allowing libcm to pro- 
vide its own event loop, calling into the applica- 
tion when flows are ready. This is most useful for 
applications coded with the CM in mind. 

3. Existing signal-driven applications may request a 
SIGIO notification from the CM when an event 
occurs 

4. Existing applications with select-based event loops 
can simply add the CM control socket into their 
select set, and call the libcm dispatcher when the 
socket is ready. Rate-clocked applications (or pure 
polling-based applications) can perform a simi- 
lar nonblocking select test on the descriptor when 
they awaken to send data, or, if they sleep, can 
replace the sleep with a timed blocking select 
call. 

5. Applications may poll the CM on their own sched- 
ule. 

6. Finally, applications may perform CM ioctlQs 
directly, though this creates potential portability 
and maintenence problems. 

The remainder of this section describes how par- 
ticular clients use different CM APIs, from the low- 
bandwidth vat audio application, to the performance- 
critical kernel TCP implementation. 



3.2 TCP 

We implemented TCP as an in-kernel CM client. 
TCP/CM offloads all congestion control to the CM, 
while retaining all other TCP functionality (connection 
establishment and termination, loss recovery and pro- 
tocol state handling). TCP uses the request/callback 
API as low-overhead direct function calls in the same 
protection domain. This gives TCP the tight control 
it needs over packet scheduling. For example, while 
the arrival of a new acknowledgement typically causes 
TCP to transmit new data, the arrival of three dupli- 
cate ACKs instead causes TCP to retransmit an old 
packet. 

Connection creation. When TCP creates a new 
connection via either accept (inbound) or connect 
(outbound), it calls cm_open() to associate the TCP 
connection with a CM flow. Thereafter, the pacing 
of outgoing data on this connection is controlled by 
by the CM. When application data becomes avail- 
able, after performing all the non-congestion-related 
checks (e.g., the Nagle algorithm [40], etc.) data is 
queued and cm_request() is called for the flow. When 
the CM scheduler schedules the flow for transmission, 
the cmapp_send() routine for TCP is called. The 
cmapp_send() for TCP transmits any retransmission 
from the retransmission queue. Otherwise, it transmits 
the data present in the transmit socket buffer by send- 
ing up to one maximum segment size of data per call. 
Finally, the IP output routine calls cm_notify() when 
the data is actually sent out. 

TCP input. The TCP input routines now feed- 
back to the CM. Round trip time (RTT) sample col- 
lection is done as usual using either RFC 1323 times- 
tamps [18] or Karn's algorithm [20] and is passed to CM 
via cm_update(). The smoothed estimates of the RTT 
(srtt) and round-trip time deviation are calculated by 
the CM, which can now obtain a better average by com- 
bining samples from different connections to the same 
receiver. This is available to each TCP connection via 
cm_query ( ) , and is useful in loss recovery. 

Data acknowledgements. On arrival of an ACK 
for new data, the TCP sender calls cm_update() to in- 
form the CM of a successful transmission. Duplicate ac- 
knowledgements cause TCP to check its dupack count 
(dup_acks) and take appropriate action (as per TCP 
semantics). If dup_acks < 3, then TCP does nothing. 
If dup_acks == 3, then TCP assumes a packet loss, 
and calls cm_update to inform the CM of the loss. TCP 
also enqueues a retransmission of the lost segment and 
calls cmjrequestQ. If dup_acks > 3, TCP assumes 
that a segment reached the receiver and caused this 
ACK to be sent. It therefore calls cm_update(). When 
the TCP retransmission timer expires, the sender iden- 
tifies that a segment has been lost and calls cm_update 
with CM_PERSISTENT option set to signify the oc- 
currence of persistent congestion to the CM. TCP also 
enqueues a retransmission of the lost segment and calls 



cm_request () . 

TCP/CM Implementation. The integration of 
TCP and the CM required less than 100 lines of changes 
to the existing TCP code, demonstrating both the flexi- 
bility of the CM API and the low programmer overhead 
of implementing a complex protocol with the Conges- 
tion Manager. 

3.3 Congestion-controlled UDP sockets 

The CM also provides congestion-controlled UDP sock- 
ets. They provide the same functionality as standard 
Berkeley UDP sockets, but instead of immediately send- 
ing the data from the kernel packet queue to lower lay- 
ers for transmission, the buffered socket implementa- 
tion makes calls to the API exported by the CM inside 
the kernel and gets callbacks from the CM. When a 
CM UDP socket is created, it is bound to a particular 
flow. Later, when data is added to the packet queue, 
cmjrequestQ is called on the flow associated with the 
socket. When the CM schedules this flow for transmis- 
sion, it calls udp_ccappsend() in the CM UDP mod- 
ule. This function transmits one MTU from the packet 
queue, and schedules transmission for any remaining 
packets. The in-kernel implementation of the CM UDP 
API adds no additional data copies or queue structures, 
and all standard UDP options are supported. Modify- 
ing existing applications to use this API requires only 
providing feedback to the CM, and setting a socket op- 
tion on the socket. 

A typical client of the CM UDP sockets will behave 
as follows, after its usual network socket initialization: 

flow = cm_open(dst, port) 

setsockopt(flow, ..., CM_BUF) 

loop: 

<send data on flow> 

<receive data acknowledgement s> 

cm_update(f low, sent, received, ...) 

3.4 Streaming Layered Audio and 
Video 

Streaming layered audio or video applications that have 
a number of discrete rates at which they can transmit 
data are well-served by the CM rate callbacks. Instead 
of requiring a comparatively expensive notification for 
each transmission, these applications are instead noti- 
fied only in the rare event that their network conditions 
change significantly. Layered applications open their 
usual UDP socket, call cm_open() to obtain a control 
socket. They operate in their own time event loop while 
listening for status changes on their control socket, or 
via a SIGIO, depending on their implementation. They 
use cm_thresh() to inform the CM about changes for 
which they should receive callbacks. 



3.5 Real-time Adaptive Applications 

Applications that desire to control which data is being 
sent in real-time (i.e. those that do not want any buffer- 
ing inside the kernel) use the request callback API pro- 
vided by the CM. On a callback from CM for transmis- 
sion of data, they may use cm_query() to discover the 
network characteristics and adapt their content based 
on that. Other servers may simply wish to send the 
most up-to-date content possible, and so will defer their 
data collection until they know they can send it. The 
rough sequence of CM calls that are made to achieve 
this in the application are: 

flow = cm_open(dst) 

cm_request (flow) 

<receive cmapp_send() callback from libcm> 

cm_query(f low, ...) 

<send data> 

cm_notif y(f low, amount of data sent) 

<receive data acks> 

cm_update(f low, sent, lost, ...) 

Other options exist for applications that wish to ex- 
ploit the unique nature of their network utilization to 
reduce the overhead of using the services of the Conges- 
tion Manager. We discuss one such option below in the 
manner in which we adapted the vat interactive audio 
application to use the CM. 

3.6 Interactive Real-time Audio 

The vat application provides a constant bit-rate source 
of interactive audio. Its inability to downsample its au- 
dio reduces the avenues it has available for bandwidth 
adaptation. Therefore, the best way to make vat behave 
in a network-friendly and backwards compatible man- 
ner is to preemptively drop packets to match the avail- 
able network bandwidth. There are, of course, compli- 
cations. Network applications experience two types of 
variation in available network bandwidth: long term 
variations due to changes in actual bandwidth, and 
short term variations due to the probing mechanisms 
of the congestion control algorithm. Short-term varia- 
tion is typically dealt with by buffering. Unfortunately, 
buffering, especially FIFO buffering with drop-tail be- 
havior, the de-facto standard for kernel buffers and net- 
work router buffers, can result in long delay and signif- 
icant delay variation, both of which are detrimental to 
vat's audio quality, vat therefore needs to act like an 
ALF application, managing its own buffer space with 
drop-from-head behavior when the queue is full. 

The resulting architecture is detailed in figure 2. 
The input audio stream is first sent to a policer, which 
provides long-term adaptation via preemptive packet 
dropping. The policer outputs into the application level 
buffer, which can be configured in various sizes and 
drop policies. This buffer feeds into the kernel buffer 
on-demand as packets are available for transmission. 
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Figure 3. Comparing throughput vs. loss for 
TCP/CM and TCP/Linux. Rates are for a 10Mbps 
link with a 60ms RTT. 



4 Experiments 

This section describes several experiments that quantify 
the costs and benefits of our CM implementation. Our 
experiments show that using the Congestion Manager in 
the kernel has minimal costs, and that even the worst- 
case overhead of the request/callback user-space API is 
acceptably small. 

Performance tests were performed on the Utah Net- 
work Testbed [21] using 350MHz Intel Pentium II 
processors, 128MB PC100 ECC SDRAM, and Intel 
EtherExpress Pro/IOOB Ethernet cards, connected via 
100Mbps Ethernet through an Intel Express 510T 24 
port 10/ 100Mbps Switch, with Dummynet channel sim- 
ulation. CM tests were run with Linux 2.2.9, with Linux 
and FreeBSD clients. 

To ensure the proper behavior of a flow, the con- 
gestion control algorithm must behave in a "TCP- 
compatible" [7] manner. The CM implements a TCP- 
style window-based AIMD algorithm with slow start. 
It shares bandwidth between eligible flows in a round- 
robin manner with equal weights on the flows. 
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Figure 4. 100Mbps TCP throughput comparison. 
Note that the absolute difference in the worst case be- 
tween the Congestion Manager and the native TCP is 
only 0.5%. 
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Figure 5. CPU overhead comparison between 
TCP/Linux and TCP/CM. For long connections, the 
CPU overhead converges to slightly under 1% for the 
unoptimized implementation of the CM. 



Figure 3 shows the throughput achieved by 
the Linux TCP implementation (TCP/Linux) and 
TCP with congestion control performed by the CM 
(TCP/CM). There are slight algorithmic differences be- 
tween the two which lead to the small differences in ob- 
served throughput, but the congestion control provided 
by the Congestion Manager still behaves in a network- 
friendly manner. 

4.1 Kernel Overhead 

To measure the kernel overhead, we measured the 
CPU and throughput differences between the optimized 
TCP/Linux and TCP/CM. The midrange machines 
used in our test environment are sufficiently powerful 
to saturate a 100Mbps Ethernet with TCP traffic. 

There are two components to the overhead imposed 
by the congestion manager: The cost of performing ac- 
counting as data is exchanged on a connection, and a 
one-time connection setup cost for creating CM data 
structures. A microbenchmark of the connection es- 
tablishment time of a TCP/CM vs. TCP/Linux indi- 
cates that there is no appreciable difference in connec- 
tion setup times. 

We used long (megabytes to gigabytes) connections 
with the ttcp utility to determine the long-term costs 
imposed by the congestion manager. The impact of the 
CM on extremely long term throughput was negligi- 
ble: in a 1 gigabyte transfer, the congestion manager 
achieved identical performance (91.6 Mbps) as native 
Linux. On shorter runs, the throughput of the CM di- 
verged slightly from that of Linux, but only by 0.5%. 
The throughput rates are shown in figure 4. The dif- 
ference is due to the more conservative initial window 
opening in the CM, not CPU overhead. 

Because both implementations are able to saturate 



the network connection, we looked at the CPU over- 
head incurred during these transmissions to determine 
the steady-state overhead imposed by the Congestion 
Manager. Figure 5 compares the CPU utilization of 
the TCP/Linux and TCP/CM. 

With long-running connections, the CPU overhead 
converged to slightly less than a 1% difference between 
the CM and the non-CM kernels. With shorter connec- 
tions, the noise in the CPU overhead was too large for 
statistically significant conclusions, but the processor 
utilizations all lie within the same small ranges. 

4.2 User-space API Overhead 

The overhead incurred by our adaptation API is pri- 
marily in the form of extra user/kernel boundary cross- 
ings (select () and ioctlQ system calls). A re- 
quest/callback client is notified for each packet it may 
send, and so represents the worst case overhead. The 
buffered send API has overhead similar to that of in- 
kernel TCP, and the overhead imposed by the rate 
change notification API varies with the frequency with 
which rates change, but in most situations is quite low. 
To measure the API overhead, we compared the 
CPU utilization of the ttcp benchmark using the ker- 
nel TCP to a user-space program sending UDP packets 
scheduled by the CM. ttcp was modified to force the 
kernel to use a small segment size, resulting in transmis- 
sion behavior identical to the user-space program. The 
test scenario was two machines connected via a 5Mbps 
link with a 40ms round trip delay. (These test parame- 
ters were chosen to minimize the impact that the delay 
software would have on accuracy.) The external behav- 
ior of the two programs is very similar — both send a 
series of packets with TCP-style congestion avoidance. 
By removing competing traffic, overhead due to TCP 
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Figure 6. API Overhead. When sending small packets, 
the API imposes measurable overhead, but even with 
the relatively modest processors in our test setup, the 
request/callback API is able to saturate a 100Mbps link 
with 900 byte and larger packets. 



Figure 7. Sharing TCP state: The client requests 
the same file 8 times with a 500ms delay between re- 
quests. By sharing congestion information, the CM- 
enabled server is able to provide faster service for sub- 
sequent requests. 



retransmissions was minimized. 

Using the request/callback API imposed between 
2.5% to 18.5% additional CPU load, varying with 
packet size. To put this in perspective, with the mod- 
est 350 Mhz processors in our testbed, our application 
was still able to saturate a 100Mbps Ethernet link when 
sending packets of 900 bytes or larger. 

Note that a real application that desired this be- 
havior would be better served by using the buffered 
send API, which has much less overhead. The re- 
quest/callback API is designed for applications that 
need last-minute control of their packet scheduling, for 
whom the additional overhead is a worthwhile trade-off 
for increased functionality. 

4.3 Benefits of Sharing 

One benefit of integrating congestion information with 
the CM is immediately clear. A client that sequen- 
tially fetches files from a webserver with a new TCP 
connection each time loses its prior congestion infor- 
mation, but with concurrent connections with the CM, 
the server is able to use this information to start subse- 
quent connections with more accurate congestion win- 
dows. Figure 7 shows a test we performed across the 
vBNS between MIT and the University of Utah, where 
an unmodified (non-CM) client performed 8 retrievals 
of the same 128k file with a 500ms delay between re- 
trievals, resulting in a 20% improvement in the trans- 
fer time for the later requests. (Other file sizes and 
delays yield similar results.) This pattern of multiple 
connections is still quite common in webservers despite 
the adoption of persistent connections: Many browsers 
and servers open 4 concurrent connections to a server, 
and many servers do not support persistent connections. 



The higher initial time is due to the CM's more conser- 
vative initial window opening. 

4.4 Adaptive Applications 

In this section, we demonstrate some of the network 
adaptive behaviors enabled by the CM. 

As noted earlier, applications that require tight 
control over data scheduling use the request /callback 
(ALF) API, and are notified by the CM as soon as they 
can transmit data. The behavior of an adaptive lay- 
ering application run across the Internet between MIT 
and the University of Utah using this API is shown 
in figure 8. This application chooses a layer to trans- 
mit based upon the current rate, but sends packets as 
rapidly as possible to allow its client to buffer more 
data. We see that the CM is able to provide sufficient 
information to the application to allow it to adapt prop- 
erly to the network conditions. 

For self-clocked applications that base their trans- 
mitted data upon the bandwidth to the client (such as 
conventional layered audio servers), the CM rate call- 
back mechanism provides a low-overhead mechanism 
for adaptation, and allows clients to specify threshholds 
for the notification callbacks. Figure 9 shows applica- 
tion adaptation using rate callbacks for a connection 
between MIT and University of Utah. Here, the ap- 
plication decides which of the four layers it should send 
based on notifications from the CM about rate changes. 

From figures 8 and 9, we see from the increased oscil- 
lation rate in the transmitted layer that the ALF appli- 
cation is more responsive to smaller changes in available 
bandwidth, whereas the rate callback application relies 
occasionally on short-term kernel buffering for smooth- 
ing. There is an overhead vs. functionality trade-off 
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Figure 8. Adaptive layered application using request 
callback (ALF) API 



Figure 10. Adaptive layered application using rate 
callback API with delayed feedback 
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Figure 9. Adaptive layered application using rate call- 
back API 



involved in the decision of which API to use, given the 
higher overhead of the ALF API, but applications face 
a more important decision about the behavior they wish 
to exhibit. 

Some applications may be concerned about the over- 
head from receiver feedback. To mitigate this, an ap- 
plication may delay sending feedback; we see this in a 
minor and inflexible way with TCP delayed acks. In 
figure 10, we see that delaying feedback to the CM 
causes burstiness in the reported bandwidth. Here, the 
feedback by the receiver was delayed by mm (500 acks, 
2000ms). The initial slow start is delayed by 2s wait- 
ing for the application, then the update causes a large 
rate change. Once the pipe is sufficiently full, 500 acks 
come relatively rapidly, and the normal, though bursty, 
non-timeout behavior resumes. 



5 Discussion 

We have shown several benefits of integrated flow man- 
agement and the adaptation API, and have explored the 
design features that make the API easy to use. This sec- 
tion describes an optimization useful for busy servers, 
and discusses some drawbacks of the current CM archi- 
tecture. 

Optimizations. Servers with large numbers of con- 
current clients are often very sensitive to the overhead 
caused by multiple kernel boundary crossings. To re- 
duce this overhead, we can batch several sockets into 
the same cmjrequest call with the cm_bulkjrequest 
call, and its corresponding bulk query, notify, and 
update calls. 

By multiplexing control information for many sock- 
ets on each CM call, the overhead associated with multi- 
ple kernel crossing is avoided, at the expense of manag- 
ing more complicated data structures for the CM inter- 
face. The bulk querying is already performed in libcm 
when multiple flows are ready during a single ioctl to 
determine which flows can send data, but this completes 
the interface. 

Trust issues. Because our goal was an architec- 
ture that did not require modifications to receivers, we 
devised a system where applications provide feedback 
using the cm_update() call. The consequence of this is 
that there is a potential for misuse, due to bugs or mal- 
ice. For example, the CM client could repeatedly mis- 
inform the CM about the absence of congestion along a 
path and obtain higher bandwidth. We do not believe 
that this in any way increases the vulnerability of the 
Internet to such problems because this is easy to do to- 
day. More important is the possibility of an application 
to falsely report congestion along a path and prevent 
another flow on the same macrofiow, but belonging to 
a different process, from making progress. While this is 
a possibility, we believe that the incentive for such be- 
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havior is small, since the malicious flow would also get 
low performance. Furthermore, we intend to explore 
macroflow-splitting algorithms that dynamically adjust 
the composition of macroflows based on flows sharing 
common performance. Another form of abuse is when 
a malicious receiver attempts to defeat congestion con- 
trol, as pointed out [37]. The techniques proposed there 
can be used in the CM as well. 

Macroflow construction Some Internet Service 
Providers deploy network-layer load balancers that 
route packets belonging to different flows along differ- 
ent paths inside the network, which might violate the 
granularity assumptions made in the CM. However, un- 
less such balancers are careful, they would route pack- 
ets on the same TCP connection along different paths, 
which would confound its loss recovery and reordering- 
detection schemes. We observe that as long as such load 
balancers work on the granularity of host addresses, 
there will be no problems for the CM to tackle. 

When differentiated services start being deployed, 
the CM would have to reconsider the default choice of a 
macroflow. We expect to be able to gain some benefit by 
including the IP differentiated-services field in deciding 
the composition of a macroflow. 

Finally, we observe that remote LANs are not often 
the bottleneck for an outside communicator. As sug- 
gested in [42, 36] among others, aggregating congestion 
information about remote sites with a shared bottleneck 
and sharing this information with local peers may ben- 
efit both users and the network itself. A macroflow may 
thus be extended to cover multiple destination hosts, all 
behind the same shared bottleneck link. The efficient 
determination of such bottlenecks remains an open re- 
search problem. 

6 Related work 

Designing adaptive network applications has been an 
active area of research for the past several years. In 
1990, Clark and Tennenhouse [10] advocated the use 
of application-level framing (ALF) for designing net- 
work protocols, where protocol data units are chosen 
in concert with the application. Using this approach, 
an application can have a greater influence over decid- 
ing how loss recovery occurs than in the traditional lay- 
ered approach. The ALF philosophy has been used with 
great benefit in the design of several multicast transport 
protocols including the Real-time Transport Protocol 
(RTP) [38], frameworks for reliable multicast [13, 33], 
and Internet video [23, 35]. 

Adaptation APIs in the context of mobile informa- 
tion access were explored in the Odyssey system [26]. 
Implemented as a user-level module in the NetBSD op- 
erating system, Odyssey provides API calls by which 
applications can manage system resources, with upcalls 
to applications informing them when changes occur in 
the resources that are available. In contrast, our CM 
system is implemented in-kernel since it has to manage 



and share resources across applications (e.g., TCP) that 
are already in-kernel. This necessitates a different ap- 
proach to handling application callbacks. In addition, 
the CM approach to measuring bandwidth and other 
network conditions is tied to the congestion avoidance 
and control algorithms, as compared to the instrumen- 
tation of the user-level RPC mechanism in Odyssey. 
We believe that our approache to providing adapta- 
tion information for bandwidth, round-trip time, and 
loss rate complements Odyssey's management of disk 
space, CPU, and battery power. 

The CM system uses application callbacks or up- 
calls as an abstraction, an old idea in operating systems. 
Clark describes upcalls in the Swift operating system, 
where the motivation is a lower layer of a protocol stack 
synchronously invoking a higher-layer function across a 
protection boundary [8] . The Mach system used the no- 
tion of ports, a generic communication abstraction for 
fast inter-process communication (IPC). POSIX speci- 
fies a standard way of passing "soft real-time signals" 
that can be used to send a notification to a user-level 
process, but it restricts the amount of data that can be 
communicated to a 32-bit quantity. 

Event delivery abstractions for mobile computing 
have been explored in [1], where "monitored" events 
are tracked using polling and "triggered" events (e.g., 
PC card insertion) are notified using IPC. This work 
defines a language-level mechanism based on C++ ob- 
jects for event registration, delivery, and handling. This 
system is implemented in Mach, using its ports as the 
abstraction. 

Our approach is to use a select () call on a control 
socket to communicate information between kernel and 
user-level. The recent work of Banga et al. to improve 
the performance of this type of event delivery can be 
used to further improve our performance. 

The Microsoft Winsock implementation is largely 
callback-based, but here callbacks are implemented as 
conventional function calls since Winsock is a user-level 
library within the same protection boundary as the ap- 
plication [31]. The main reason we did not implement 
the CM as a user-level daemon was because TCP is al- 
ready implemented in-kernel in most UNIX operating 
systems, and it is important to share network informa- 
tion across TCP flows. 

Quality-of-service (QoS) interfaces have been ex- 
plored in several operating systems, including Neme- 
sis [15]. Like the exokernel approach [19] and SPIN [6], 
Nemesis enables applications to perform as much of the 
processing as possible on their own using application- 
specific policy, supported by a different set of operat- 
ing system abstractions from UNIX. Whereas Neme- 
sis treats local network-interface bandwidth as the re- 
source to be managed, we take a more end-to-end ap- 
proach of discovering the end-to-end performance to dif- 
ferent end-hosts, enabling sharing across common net- 
work paths. Furthermore, the API exported by Nemesis 
is useful for applications that can make resource reser- 
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vations, while the CM API provides information about 
network conditions. 

Multiple concurrent streams can cause problems for 
TCP congestion control. First, the ensemble of flows 
probes more aggressively for bandwidth than a single 
flow. Second, upon experiencing congestion along the 
path, only a subset of the connections usually reduce 
their window. Third, these flows do not share any in- 
formation between each other. While we propose a gen- 
eral solution to these problems, application-specific so- 
lutions have been proposed in the literature. Of partic- 
ular importance are approaches that multiplex several 
logically distinct streams onto a single TCP connection 
at the application level, including Persistent-connection 
HTTP (P-HTTP [28], part of HTTP/1.1 [11]), the Ses- 
sion Control Protocol (SCP) [39], and the MUX pro- 
tocol [14]. Unfortunately, these solutions suffer from 
two important drawbacks. First, because they are 
application-specific, they require each class of applica- 
tions (Web, real-time streams, file transfers, etc.) to 
reimplement much of the same machinery. Second, they 
cause an undesirable coupling between logically differ- 
ent streams: if packets belonging to one of the streams 
is lost, another stream could stall even if none of its 
packets are lost because of the in-order "linear" deliv- 
ery forced by TCP. Independent data units belonging 
to different streams are no longer independently proces- 
sible and the parallelism of downloads are often lost. 

7 Conclusion 

The CM system enables applications to obtain an un- 
precedented degree of control over what they can do 
in response to different network conditions. It incorpo- 
rates robust congestion control algorithms, freeing each 
application from having to reimplement them. It ex- 
poses a rich API that allows applications to adapt their 
transmissions at a fine-grained level, and allows the ker- 
nel and applications to integrate congestion information 
across flows. 

The implementation of the CM provides easy-to-use 
facilities for congestion control and integrated flow man- 
agement. Our performance evaluation shows that it is 
possible for an operating system to incorporate these 
services with low overhead. 
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